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Abstract 


Cbeclqminting  techniques  in  parallel  systems  use  dependency  tracking  and/or  message  logging  to  ensure 
that  a  system  rolls  back  to  a  consistent  state.  Traditional  dependency  tracking  in  distributed  shared  memory 
systems  (DSM)  is  expensive  because  of  high  frequency  of  communication.  In  this  p^)er  we  show  that, 
because  of  information  redundancy,  not  all  messi^e-passing  depencfences  need  to  be  considered  to  roll 
back  to  a  consistent  state  in  DSM  systems,  resulting  in  reduced  dependency  tracking  overhead  and  reduced 
potential  for  rollback  propagation.  We  develop  a  model  of  execution  where  client  processes  running  an 
application  intmact  atomically  with  a  set  of  shared-memory  server  processes  on  every  access  to  shared 
riata  We  show  that  under  this  model,  dependences  are  significantly  reduced  over  the  message-passing 
model.  We  use  results  from  simulations  with  multiprocessor  address  traces  to  demonstrate  the  reduction  in 
dependences. 

94-22067 

lllllllll 


This  leseaich  was  sufqxmed  in  part  by  the  OCBce  of  Naval  Research  under  contract  N00014-91-J-1283,  and  by  the  National 
Aeronautics  and  Space  Adminutration  (NASA)  under  Grant  NASA  NAG  1-413,  in  cooperation  with  the  Illinois  Computer 
Laboratory  for  Aermpace  Systems  and  Software  (ICLASS). 


DISTRIBUTION  STATEMENT  A;  APPROVED  FOR  PUBLIC  RELEASE:  DISTRIBUTION  IS  UNLIMITED 

94  7  14  017 


1  Introdnctioii 
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C3iedq)omtuig  and  lollbadc  recovery  techniques  are  extensively  used  to  allow  a  computation  to  maka 
progress  in  spite  of  failures.  A  large  body  of  research  exists  on  the  plication  of  checkpointing  and 
rtdlback  recovery  to  message-passing  systems  with  an  emphasis  on  maintaining  consistency  without  allowing 
excesuve  rollback  propagation.  The  problem  of  maintaining  consistency  in  recoverable  shared  memory 
systems  is  not  as  well-understood.  Distributed  shared  memmy  (DSM)  systems  use  cache  cohemice 
hardware  [13]  or  shared  virtual  memory  software  [14]  to  provide  a  shared  memmy  image  on  top  of 
message  passing,  bi  DSM  systems,  conununicatimi  frequency  is  much  hi^ier  dian  in  pure  message¬ 
passing  systems.  Therefme,  using  traditional  message-passing  recovery  techniques  incurs  a  high  ovohead 
in  tracking  dqtendmices  betwera  processes.  Furthermore,  the  large  number  of  dependences  increase  the 
likelyhood  of  the  need  to  propagate  rollbacks  to  maintain  consistency. 

In  this  pqter,  we  show  that,  by  considering  information  redundancies  in  DSM  systems,  the  number  of 
depoidences,  and  thereftne  the  overhead  of  maintaining  consistency  with  rollbacks,  can  be  significantly 
reduced.  We  model  a  DSM  system  as  a  set  of  client  processes  ruiming  an  sq>plication  program  that  interact 
atmnically  with  a  set  of  shared-memory  senm  processes  on  every  access  to  shared  data.  We  show  that, 
under  this  model,  many  of  the  messages  transmitted  during  interactions  do  not  result  in  dependences  between 
processors,  and  therefore  do  not  have  to  be  considered  when  tolling  back  to  a  consistent  state.  We  back  our 
claims  with  results  fnnn  simulaticms  using  multiptocessor  address  traces. 

Dependoices  carried  by  messages  have  to  be  considered  in  any  approach  to  checkpointing  and  rollback 
recovoy  for  distributed  systems.  EvcainMiLy  coordinated  checkpomHng[6,l,  12,  IS],  where  all  processes 
synchronize  to  take  a  global  checkpoint,  dependences  may  cause  additional  overiiead  by  aborting  tentative 
checlqxnnts.  The  coordination  algorithm  can  be  improved  by  adding  depemtency  tracking.  Then  only 
processes  that  have  depmdences  in  the  current  checlqwint  interval  need  to  synchronize  checkpointing  and 
rollback  [12].  Due  to  process  independence,  recovery  efficiency,  or  I/O  bandwidth  requirements  it  may  nm 
be  desind>le  to  synchronize  checlqxnnts.  Independent  checkpointing  replaces  synchronization  by  depen- 
doicy  tradcing  and/or  message  logging,  both  of  which  introduce  overhead  for  every  dependence-carrying 
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Tiwfflf  [5, 8. 18, 20].  Eveiy  depaideiice<aiiying  message  also  introduces  a  chance  of  rollback  propaga- 
tioa,  whidi  can  escalate  into  die  dominoeffect  Some  schemes  use  commumco/ton-induced  checkpointing  to 
bound  ndlback  propagation  where  messages  can  induce  checkpoints  on  other  (nocessors.  In  these  schemes, 
the  large  overhead  of  taking  cfaedcpoints  is  affected  by  the  pattnn  of  dependences. 

Vuious  distributed  system  recovery  techniques  have  been  ^lied  to  shared  memory.  Communication- 
induced  cbeclqKnnting  is  used  in  the  Sequoia  system  [4],  by  Wu  etal.m  both  bus-based  multiprocessors  [22] 
and  stdtware  DSM  systems  [23],  and  by  Jaasaeas  and  Fiichs  in  DSM  systems  with  relaxed  consistency  [1 1]. 
Ahmedernl.  presented  three  schemes  for  bus-based  systems  that  use  fully  coordinated,  partially  coordinated 
and  communication-induced  checlqminting  respectively  [2].  Ban&tie  et  al.  have  proposed  a  scheme  that 
uses  dependency  tracking  at  die  shared  memory  in  a  bus-based  system  to  implement  partially  coordinated 
ciMiekpftinring  [3].  Richard  and  Singhal  have  (mqiosed  using  independent  checkpointing  and  logging  of  all 
memory  accesses  to  implement  recovery  in  piecewise  deterministic  DSM  systems  [17]. 

It  is  obvious  that  the  dependences  of  message-passing  are  too  strict  for  shared-memory  parallel  programs. 
For  instance,  two  reads  by  different  processes  to  a  shared  variable  with  no  intervening  writes  do  not  depend 
rni  each  other  even  though  both  processes  exchange  messages  with  the  shared  memory  element.  In  the 
literature  on  replay  for  debugging  in  shared  memory  systems,  a  dependence  from  memory  access  a  to 
memory  access  b  is  generally  said  to  exist  if  a  accesses  a  shared  variable  that  b  later  accesses,  and  at  least 
<me  of  the  two  accesses  is  a  write  [16].  Gunaseelan  and  LeBlanc  have  recently  argued  that  a  write  causes  a 
two-way  dependence  with  a  memory  element,  while  a  read  only  causes  a  dependence  from  the  memory  to 
the  {nocess  [10].  Therefore  there  is  no  dependence  from  a  read  to  a  write  if  the  read  precedes  the  write. 

A  more  relaxed  dependency  model  than  that  for  messi^e-passing  can  be  used  for  rollback  recovery 
only  if  there  is  iu>  possibility  of  deadlock  due  to  processes  waiting  for  messages  that  may  never  arrive.  In 
bus-based  systems,  where  bounded  transmission  delay  eliminates  the  ireed  for  acknowledgements,  deadlock 
is  avoided.  In  DSM  systems,  however,  other  measures  have  to  be  taken  to  avoid  messaging  deadlock.  In 
the  bus-based  recovery  scheme  of  Banfltre  er  oA  a  dependence  is  recorded  between  any  processor  that  writes 
a  Hiim  item  and  another  that  reads  it  A  bi-directional  dependence  is  recorded  between  two  processors  that 
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write  a  data  hem  consecutively  [9].  Wu  et  oL's  recovery  scheme  takes  a  checlqwint  of  the  miginating 
processaru/ofthedataitemaccessedonevery  data  transfer  between  processing  nodes  [22,23].  Theeffect 
of  bodi  schemes  is  to  confonn  to  the  dependence  relation  described  by  Gunaseelan  and  LeBlanc  [10].  A 
previous  paper  by  us  also  used  similar  assumptions  about  shared-memory  dq)endences  to  reduce  the  number 
of  chedqwint  needed  to  avoid  rollback  propagaticm  in  DSM  systems  widi  a  relaxed  memory  consistency 
model  [1 1].  ha  this  pi^  we  validate  the  dependence  assumptions  made  in  previous  work,  by  showing  how 
dqiendences  can  be  removed  in  a  DSM  system  without  affecting  correct  execution  after  rollback. 


2  Motivation 

To  motivate  our  iqrptoach  to  reducing  overhead  by  decreasing  the  number  of  dependence-carrying  messages, 
we  (uesent  an  example  of  a  read  and  a  write  access  in  a  typical  DSM  system.  The  system  consists  of  a 
numbo'  of  processing  nodes  which  cache  copies  of  the  shared  variables  they  reference.  The  shared  memory 
is  divided  into  Modes,  each  of  which  has  a  home  node  to  manage  ownership  rights.  As  shown  in  Figure  1, 
the  memory  block  p  is  managed  by  node  M  and  accessed  by  nodes  A  and  B.  The  example  uses  Li’s  fixed 
distributed  manage  algorithm  for  shared  virtual  memory  to  maintain  cdierence  [14].  Since  their  operation 
is  similar,  the  example  also  applies  to  hardware  DSM  systems  using  a  distributed  directory  based  coherence 
protocol  [1, 13]. 

In  our  example  in  Figure  1,  memory  block  p,  which  contains  variable  x,  is  originally  cached  in  a 
read-only  state  by  tKxles  A,  B  and  C.  A  write  to  variable  x  by  the  user  i»ocess  on  node  A  causes  it  to  send  the 
new  value  of  z  to  its  local  server  for  block  p  and  wait  If  this  server  had  write  permission,  it  would  update 
its  state  and  irrunediatBly  return  an  acknowledgement  to  the  user  process.  Instead,  it  sends  a  message  asking 
for  write  access  to  the  nuuia^  for  block  p  on  node  M.  The  manager  forwards  the  request  to  the  owner  of 
p,  in  this  one  node  B,  and  then  notes  that  node  A  is  now  the  owner  of  p.  The  server  on  node  B  sets  its 
access  permissirm  to  nrme  for  block  p,  and  sends  a  copy  of  block  p,  together  with  a  list  of  all  nodes  that  have 
access  to  h,  to  node  A.  Upon  receipt  of  the  message,  the  server  on  node  A  knows  that  C  still  has  access  to 
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Nod*  A: 


Figure  1:  Client-server  intoactions  in  distributed  shared  memory. 


the  blodc,  so  it  soids  an  invalidation  message  to  iKxie  C  The  s^er  on  node  C  oases  its  access  permission 
top  and  returns  an  acknowledgemait  Once  the  server  on  node  A  receives  the  acknowledgement,  it  notifies 
die  user  process,  which  resumes  execution. 

When  the  user  process  on  node  B  later  decides  to  read  variable  x,  it  sends  a  read  request  to  its  server 
fiir  block  p.  Since  it  does  not  have  read  permissirm,  this  server  asks  the  manago  of  p  fcv  a  readable  copy. 
Since  there  is  a  possibly  dirty  copy  of  p  on  node  A,  the  mamiger  forwards  the  request  to  node  A.  The  server 
on  tKxle  A  decreases  its  access  permission  to  read-only,  notes  that  node  B  now  has  access  to  p.  and  sends  a 
copy  of  p  to  the  server  of  node  B.  Upon  receiving  its  copy  of  p,  the  server  on  node  B  returns  the  value  of 
varidrle  X  to  die  user  process. 

One  way  to  onure  that  the  DSM  system  in  our  example  rolls  back  to  a  consistent  state  is  to  treat  every 
message  sent  between  nodes  as  a  dqiendence-canying  message.  In  our  example,  the  system  would  have  to 
track  8  dqiendences  betweoi  4  processing  nodes  f(^  the  two  memory  accesses  shown.  Assume  that,  after 
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conqriedng  the  read  access,  node  B  detected  an  emv,  rolled  back,  and  restored  its  state  from  a  checkpoint  it 
took  before  the  read  access.  After  node  B*s  rollback,  the  global  state  of  the  system  would  be  that  indicated 
by  the  dashed  recovery  line  in  Hgure  1.  Because  the  resultant  global  state  reflects  the  receipt  of  a  message 
by  trade  M,  but  not  its  soiding  by  node  B,  it  is  not  CMsistent  in  terms  of  message  passing.  To  maintain  a 
coosistoit  stale,  both  nodes  A  and  M  would  also  have  to  roll  back  to  a  point  before  the  read. 

Howevm;  in  terms  of  shared  memory,  there  is  no  reason  fm  the  state  indicated  by  the  dashed  line  to 
be  inconsistent  In  fact,  nodes  A  and  M  only  supplied  data  in  response  to  the  request  for  read  access;  they 
did  not  change  their  mtemal  state.  Therefore,  rolling  them  back  is  superfluous.  The  request  messages  ftom 
node  B  to  node  M  and  ftom  node  M  to  node  A  therefcMe  do  not  carry  any  permanent  dependences.  The 
rqfly  message  supplies  a  block  of  data  ftom  node  A  to  node  B  and  therefore  does  carry  a  dependence.  In 
die  message-passing  model,  since  this  message  crosses  the  recovery  line,  either  it  has  to  be  retrievable  after 
rollback  (for  instance,  by  logging  it),  or  the  dependmce  has  to  be  considered  bi-directional  [6,  12,  20]. 
However,  after  rollback  node  B  can  simply  send  a  new  request  to  for  block  p  when  it  re-executes  the  read 
access.  Therefore  the  dependence  from  node  A  to  node  B  can  be  considered  unidirectional,  even  if  the 
message  can  not  be  retrieved  after  rollback. 

The  only  permanent  dependence  in  our  example  is  ftom  node  A  to  node  B.  However,  there  are  temporary 
dqiendences  that  can  cause  incorrect  execution  or  deadlock  when  nodes  roll  back  while  a  request  is  being 
serviced.  Consider  the  global  state  indicated  by  the  dotted  line  in  Hgure  1.  This  state  corresponds  to  node  A 
detecting  an  error  while  servicing  the  read  request  and  rolling  back  to  a  checkpoint  taken  prior  to  servicing 
the  read  request  In  this  case  the  two  request  messages  have  to  be  consictered  as  dependences  and  nodes  B 
and  M  need  to  roll  back.  Otherwise  node  B  would  wait  forever  to  receive  a  reply  ftom  node  A  after  it  sent  its 
request  for  read  access.  However,  if  node  A  had  rolled  back  after  servicing  the  read  request  node  B  would 
have  already  received  its  reply  ftom  node  A,  and  there  would  not  be  a  consistency  problem.  Therefore  the 
request  messages  introduce  a  temporary  but  not  a  permanent  dependence. 

From  the  example  just  discussed,  it  is  clear  that  not  all  messi^e-passing  dependences  need  to  be 
considered  for  correct  rollback  in  a  DSM  system.  Every  message  introduces  a  temporary  dependence.  But 
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becaise  of  the  request-reply  nature  of  communication  in  DSM,  some  of  the  temporary  dependences  disappear 
after  an  iittenction  between  processors  is  complete.  To  determine  which  messages  permanent 
dependences,  we  need  to  a  model  of  communicatimi  in  DSM  systems.  Such  a  model  is  described  in  the 
next  section. 


3  A  Fftssive  Server  Model  of  Distributed  Shared  Memory 

We  model  program  execution  in  DSM  systems  as  a  set  of  client  processes  which  run  the  application 

program  and  a  set  of  passive  server  processes  which  provide  a  shared-memory  image  to  the  clients.  The 

servers  are  considered  passive  since  they  only  change  state  due  to  interaction  with  a  client  Processes 

communicate  via  messages  sent  through  reliable  charmels.  We  represent  the  overall  program  execution  by  a 
pd  pd 

pair,  P  —  (£,  — »),  where  17  is  a  set  of  events  and  — *■  is  the  possible  dependence  relation  defined  over  E. 
Events  within  a  jKocess  are  (»deted  by  the  ^  (execution  order)  relation.  Every  event  represents  an  atomic 
actimi  which  may  change  the  state  in  <Mie  of  the  processes.  A  special  checkpoint  event  can  be  insetted 
between  two  events  to  record  the  current  state  of  the  process. 

In  the  clients,  events  can  be  either  internal  events,  read  events,  or  write  events.  Internal  events  only 
depend  on  and  affect  the  local  state  of  the  process.  Read  evmits  send  a  read  request  to  a  local  server,  wait 
for  a  r^ly  with  the  value  of  a  data  item  and  then  update  their  local  state  with  the  value.  Write  events  send 
a  write  request  with  a  value  to  a  local  server  and  wait  for  a  reply.  Events  in  servers  are  always  triggered 
by  the  receipt  of  a  request  message,  either  ftom  a  client  or  another  server.  Request  messr^es  are  handled 
by  the  servers  in  FIFO  order.  After  the  request  nressage  is  received,  server  events  may  send  and  receive 
additional  messages.  A  write  or  read  event  in  a  client,  together  with  the  events  it  causes  in  the  servers  may  be 
collectively  called  an  interaction.  Sends  and  receives  are  ordered  by  the  relation:  a  b  means  event 
a  sent  a  message  and  event  6  received  it  The — » relaticHi  is  the  union  of  the  other  two:  — ^  =  — >^U — 
Hgnre  2  presents  the  two  interactions  in  the  example  of  Section  2  in  terms  of  the  — » relation  between 
events. 
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Hgiue  2:  Write  and  read  accesses  in  the  passive  server  model 


When  a  process  detects  an  error,  it  notifies  all  other  processes  and  rolls  back  to  a  chedcpoint.  Upon 

receiving  notification  of  a  rollbadc,  a  process  can  either  roU  back  to  a  checlqxnnt,  or  ccmtinue  operation.  If 

it  ccmtinues,  we  can  treat  the  current  volatile  state  as  a  virtual  checkpoint  [21].  A  global  recovery  line  is 

a  set  of  real  and  virtual  dieckpoints,  one  per  process.  Qmsider  two  events  a  and  b,  where  b  occurs  in  the 

execution  order  before  the  global  recovery  line  and  a  occurs  in  the  execution  order  after  the  global  recovery 

liiM.  A  global  chedqroint  is  consistrat  if  there  are  no  two  such  events  such  that  a  — >■  6  or  6  — a.  A 

global  chedcpoint  is  also  consistent  if  lost  nressages  can  be  retrieved  during  reexecution  and  there  are  no 
hi 

two  events  such  that  a  — »  b. 

To  simfriily  reasoning  aboift  consistency  of  global  chedqmints  it  is  useful  to  treat  the  relation  as 
bidirectional.  To  do  this  we  replace  every  dependence  a  — »  b,  by  a  causal  dependence  a  — ^  b,  and  a 
logging  depoidence  6  a.  Ccmsider  again  two  events  a  and  b,  where  b  occurs  in  the  execution  order 
before  the  global  recovery  litM  and  a  occurs  in  the  execution  order  after  the  global  recovery  line.  The 
requirements  for  consistency  are  now  that  there  are  no  two  such  events  such  that  a  — b  and  there  are  no 
two  such  events  such  that  b  a  and  the  message  bdween  a  and  6  is  unlogged.  As  was  shown  by  the 
example  in  the  previous  section,  because  of  the  structure  of  events  and  messages,  some  of  the  — » and  — ^ 
dependences  truy  be  eliminated.  By  examining  the  various  events  in  a  spedfic  DSM  system,  we  can  derive 


new  and  relations  diat  are  subsets  of  the  old. 

4  Recovery  to  a  Consisteiit  Stole  in  Distributed  Shared  Memory 

Wb  now  analyze  the  fixed  distributed  manager  scheme  fw  implementing  DSM  systems  using  the  passive 
server  model.  The  model  requires  that  all  execution  by  a  server  done  in  response  to  a  request  constitute  one 
atomic  event  First  we  show  how  to  satisfy  this  requirement  in  a  recoverable  DSM  system.  Then  we  nuq> 
all  the  interactioos  that  occur  in  the  fixed  distributed  manager  scheme  into  the  passive  server  model  We 
dww  that  redundancies  in  a  DSM  system  allow  eliminating  many  of  the  dependences  between  processing 
nodes.  We  consider  two  mediods  of  taking  advantage  of  informatimi  redundancy  which  result  in  different 
patterns  of  dependmces.  In  the  volatile  ownership  method,  whoi  ownership  information  is  lost  by  a  block’s 
manager,  it  can  be  reclaimed  through  a  broadcast  to  all  the  nodes.  In  the  volatile  access  rights  method, 
when  information  about  access  rights  is  lost  by  a  nodb  server,  it  can  be  reclaimed  by  ccmtacting  the  block 
managa 

4.1  Mafaitolnfaig  atomicity  trf  server  evwits 

To  be  to  avoid  issues  of  deadlock  caused  by  partially  completed  server  operations,  the  passive  server 
model  treats  a  server’s  whole  response  to  a  request  message  as  an  atomic  event  To  maintain  atomicity,  a 
process  should  never  roll  back  to  a  recovery  line  with  partially  completed  events.  Therefore  all  checkpoints 
need  to  be  constrained  to  occur  only  outside  of  interac^ons.  Ftiitbermore,  if  a  process  participating  in  an 
intnaction  rolls  back,  all  other  processes  currently  in  the  interaction  also  need  to  roll  back. 

Interactitms  are  relatively  short-lived  events;  most  of  the  time  clients  are  executing  the  application  and 
servers  are  waiting  to  process  the  next  request  A  simple  and  low-overhead  scheme  to  handle  the  case  where 
an  error  does  occur  in  an  interaction  is  to  simply  roll  back  every  process  that  is  involved  in  an  interaction 
when  it  receives  notice  that  a  recovery  is  necessary. 
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Hgure  3:  Possible  intenctioiis  during  accesses  in  a  fixed  distributed  manager  DSM. 

4J2  Node  intenctioiis  in  the  passive  server  modd 


Figure  3  presents  all  the  evoits  that  can  occur  in  the  fixed  distributed  manager  scheme  during  interactions 
between  the  servers  on  eveiy  node.  Since  a  client  process  always  only  communicates  with  a  server  on  its 
own  node,  the  read  or  write  event  on  the  client  is  not  shown.  Hie  dependences  between  the  events  on  the 
different  nodes  are  shown  in  terms  of  the  relation.  Next  to  the  graphical  representation  of  an  event,  the 

state  changes  of  that  event  are  shown. 

hi  addition  to  the  state  of  the  block,  a  block  server  on  a  node  maintains  three  pieces  of  information. 
The  access  rights  determine  what  kind  of  accesses  a  node  is  permitted  to  make.  A  node  can  either  have 
read-write  access  (W),  read-only  access  (R),  or  no  access  (I)  to  a  block.  The  copyset  (indicated  by  "cs"  in 
the  Figure)  is  used  to  eliminate  the  need  for  broadcasting  when  all  copies  of  a  block  need  to  be  invalidated. 
It  keqM  a  record  of  all  nodes  that  might  possibly  have  read-only  access  to  a  block.  The  server  on  a  node 
also  keeps  track  of  whether  or  not  it  is  the  current  owner  of  a  block.  The  manager  for  a  block  only  maintains 
one  piece  of  informatioa,  the  identity  of  the  current  owner  of  the  block. 
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4J  Redmidaiicy  in  DSM 


ScMiie  of  the  possible  d^oidences  in  a  DSM  system  can  be  eliminated  because  of  redundancies  between 
nodes  in  the  maintmiance  of  data.  There  are  a  few  specific  attributes  of  DSM  that  enable  the  removal  of 
dependences.  We  make  sevnal  assumptions  about  the  operation  of  DSM  in  our  model.  The  first  assumption 
is  that  a  nmi-owned  read-only  block  can  be  sptmtaneously  invalidated.  We  also  assume  that  write  permission 
cm  be  spontaneously  takm  away  from  a  block  to  make  it  read-only  as  long  as  its  contents  are  backed  up 
on  another  node.  A  third  assumption  that  needs  to  be  made  is  that  the  copyset  on  a  node  may  contain  a 
superset  of  all  the  other  nodes  that  have  a  read-only  copy  of  a  block.  When  an  invalidate  request  is  sent  to 
a  block  server  that  does  not  have  R  access,  the  request  is  dmied.  It  is  not  difficult  to  design  a  DSM  system 
that  ccmforms  to  the  above  assumptions.  In  fact,  these  assumptions  need  to  by  made  in  any  DSM  system 
where  cache  space  on  the  nodes  is  limited.  In  such  a  system  it  is  necessary  to  be  able  to  invalidate  recently 
unused  blocks  to  make  room  f(V  blocks  of  new  data. 

There  is  fiirdier  redundancy  between  the  ownership  information  maintained  by  the  block  manager  and 
the  access  right  informaticm  maintained  by  the  block  servers.  The  ownership  information  maintained  by 
the  block  managers  need  not  always  be  correct  for  correct  operation  of  the  protocol.  If  the  manager  routes 
a  request  for  access  to  a  block  server  that  does  not  have  ownership,  the  server  denies  the  request,  and 
the  manager  finds  the  owner  through  a  broadcast  to  all  nodes.  Therefore  it  is  possible  to  allow  volatile 
ownership  of  blocks  after  rollback.  Ownership  information  is  not  checkpointed,  and  if  a  manager  needs  to 
roll  back,  the  ownership  field  is  set  to  unknown.  The  first  access  to  the  block  after  the  rollback  will  then 
cause  the  correct  owner  to  be  found  thrcHigh  broadcast 

Alternatively,  one  can  always  maintain  correct  ownership  information  in  the  manager,  but  allow  volatile 
access  rights,  bi  this  case,  access  rights  and  ownership  information  in  the  block  servers  need  not  be  restored 
after  rollback  [17,  23].  If  a  node  block  server  needs  access  to  a  block  after  rollback,  it  contacts  the  block 
manager.  The  block  manager  then  forwards  the  request  to  the  node  that  it  considers  the  owner  of  the  block. 
This  will  always  be  the  node  that  had  ownership  last  before  the  recovery  line. 
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One  final  redundancy  in  the  DSM  system  is  the  copyset  It  is  used  only  to  detennine  which  nodes  have 
vdid  ct^nes  whmi  a  block  needs  to  be  invalidated  on  all  nodes.  If  the  copyset  is  not  available,  an  invalidate 
request  can  simply  be  broadcast  to  every  node.  Therefore,  in  both  the  volatile  ownership  and  the  volatile 
access  rights  cases,  the  copyset  maintained  by  the  block  servers  can  also  be  considered  volatile. 


4A  FHinln«ring  dependences  with  volatile  ownership 

Through  analysis  of  the  specific  interactions  in  the  volatile  ownership  case,  where  ownership  and  copyset 
infcmmation  is  lost  afto'  rollback,  it  is  possible  to  eliminate  some  of  the  message-passing  dependences. 

Consider  a  remote  read  access  to  a  clean  block  in  the  fixed  distributed  manager  protocol  ( Figure  3a ). 
The  message-passing  dependences  between  local  (L),  manager  (M),  and  remote  (R)  nodes  are:  L-^M, 
M  X,  M  R,  R  Af ,  R  i,  and  i  R.  The  state  of  node  R  is  not  affected  by  the 

interaction,  except  for  the  addition  of  a  member  to  tire  copyset  If  node  R  node  rolls  back,  only  the  copyset 
informatitm  is  lost  and  can  be  recovered  through  broadcast  SoiZ  Af  can  be  eliminated.  IfnodesLand 
M  roll  back,  the  state  of  the  recovery  line  is  the  same  as  if  ail  three  nodes  rolled  back,  except  for  the  extra 
member  of  the  copyset  Siiree  one  of  our  assumpticms  allows  extra  members  of  the  copyset  the  resulting 
State  is  consistent  So  M  — >  R  can  be  eliminated.  Since  node  M  does  not  change  state  at  all  during  the 
interaction,  X  — M  and  M  — >  X  can  also  be  eliminated.  The  only  remaining  dependences  are  between 
node  R  and  node  L.  Assume  node  L  rolls  back.  The  recovery  line  is  equivalent  to  the  earlier  treated  case 
where  both  nodes  L  and  M  roil  back,  so  it  is  consistent  Thus  X  R  is  eliminated.  The  final  causal 
dependence,  R  — *  X,  can  not  be  eliminated  since  a  block  of  data  is  transmitted  from  node  R  to  node  L. 

Now  ctmsider  a  remt^  read  access  to  a  dirty  block  ( Hgure  3b ).  Node  M  merely  forwards  the  request 
and  does  not  change  state.  Therefore  the  only  possible  dependences  are  between  nodes  L  and  R.  If  node  L 
rolls  back,  node  R  has  read-only  access  to  a  modified  block.  However,  node  R  is  still  the  owner  of  the  block, 
so  any  further  requests  for  the  block  will  always  be  routed  to  and  serviced  by  node  R.  Therefore  R  X 
can  be  eliminated.  Again,  the  causal  dependence  X  — »  has  to  be  maintained. 
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Figure  4:  Reducing  dependences  on  remote  reads  with  volatile  ownership. 

Hgure  4  shows  the  dependences  on  read  access  in  the  message-passing  model,  and  with  our  model. 
They  have  decreased  from  three  causal  and  three  logging  dependences  between  three  nodes  to  just  one 
causal  dependence  between  two  nodes. 

Next,  we  consider  a  remote  write  access  ( Figure  3c,  3d ).  Ignoring  invalidations,  the  message-passing 
dependences  are  the  same  as  in  the  read  accesses.  In  a  remote  write  interaction,  node  M  changes  state;  it 
records  the  new  owner  of  the  block.  However,  since  ownership  information  in  the  manager  is  volatile,  we 
can  again  ignore  the  dependences  to  node  M.  If  node  L  rolls  back,  the  recovery  line  may  have  no  node 
claiming  ownership  of  the  block,  so  ^  ,  a  with  a  broadcast  no  owner  will  be  found.  So  the  dependence 
R  — f  L  remains.  The  causal  dependence  L  — >  R  also  remains  since  it  transmits  a  block  of  data. 

If  the  block  is  readable  by  more  than  one  remote  node  when  the  local  node  asks  for  write  access,  all  the 
copies  in  the  remote  nodes  will  be  invalidated.  One  of  the  assumptions  we  made  earlier  is  that  the  system 
can  handle  the  spontaneous  invalidation  of  a  read-only  block  when  it  is  not  owned  by  the  node  on  which 
it  resides.  Therefore  node  L  can  roll  back  past  an  interaction  in  which  it  invalidated  a  remote  node  R’.  At 
the  recovery  line,  it  will  simply  qrpear  as  if  node  R’  has  been  invalidated  spontaneously.  Therefore  there 
is  no  &Bpmdieacc  L  — *■  Rf.  If  node  R’  rolls  back,  however,  there  is  a  possibility  that  both  read-only  and 
writid>le  copies  exist  in  the  system.  This  is  not  allowable  in  the  coherence  protocol.  If  the  invalidation 
message  is  logged,  however,  it  can  be  relayed  after  the  remote  node  rolls  back,  guaranteeing  there  exists 
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Hgine  5:  Reducing  dependences  on  remote  writes  and  invalidates  with  volatile  ownership. 

Q 

only  a  writable  cq;>y  of  the  block  in  the  system.  So  the  causal  dependence  B!  — *  L  can  be  eliminated,  but 
thoe  remains  a  logging  dependence  Bf  L. 

Hgure  S  shows  the  depoidences  on  a  remote  write  access  in  the  message-passing  model,  and  with 
our  model.  The  dqiendences  that  occur  on  each  invalidation  message  are  shown  separately.  Our  model 
decreases  the  dependoices  on  a  remote  write  access  from  three  causal  and  three  logging  dependences 
between  three  nodes  to  one  causal  and  otK  logging  depmidence  between  two  nodes.  In  addition,  for  any 
invalidatirm  that  needs  to  be  performed,  it  decreases  the  dq)endences  from  two  causal  dependences  to  one 
logging  dependence. 

45  EUmiiiatiiig  d^Mndences  with  volatile  access  rights 

With  diffemt  assumptions  about  which  information  is  lost  upon  rollbacks,  different  dependences  exist 
between  nodes.  We  now  analyze  die  volatile  access  rights  case,  where  the  block  servers  lose  access  rights, 
ownership,  and  copyset  information  uprm  rollback,  but  the  block  managers  do  retain  ownership  information. 
Upon  access  to  a  node  diat  has  lost  access  ri^ts  due  to  rollback,  the  manlier  is  crmsulted  for  information 
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about  the  ownership  of  a  node.  The  manager  then  requests  the  page  from  the  node  it  considers  the  owner. 
This  node  then  supplies  the  page  to  the  requesting  node. 

Again  we  have  to  consider  all  the  cases  of  remote  accesses.  On  a  remote  read  access  to  a  clean  block 
(Figure  3a ),  the  state  of  manager  node  M  is  not  affected.  Therefore,  as  in  previous  cases,  the  dependences 
with  node  M  can  be  eliminated.  When  node  L  rolls  back,  the  state  of  the  recoveiy  line  is  the  same  as  if 
node  R  also  rolled  bade,  except  for  the  extra  member  of  the  copyset  Since  the  copyset  is  allowed  to  be 
a  superset  of  all  the  nodes  that  have  readable  copies,  the  recoveiy  line  is  consistent.  So  the  dependence 
L  R  can  be  eliminated.  When  the  remote  read  access  is  to  a  dirty  block  ( Hgure  3b  ),  a  rollback  of 
node  L  wiU  cause  a  situation  where  node  R  has  lost  write  permission  without  guaranteeing  that  a  copy  of 
die  dirty  page  has  been  saved  oa  another  node.  Howevm;  the  manager  still  considers  node  R  the  owner,  so 
any  ftiidier  requests  will  be  supplied  from  its  copy  of  the  block.  Therefore  the  dependence  L  R  can 
again  be  eliminated.  So,  on  a  remote  read,  there  remains  only  the  causal  dependence,  R  L,  from  the 
remote  node  to  the  local  node. 

On  a  remote  write  access  (  Figure  3c,  3d  ),  without  the  volatile  ownership  assumption,  it  is  now 
necessary  to  ctmsider  the  dqiendences  with  the  manager.  If  node  L  rolls  back  past  the  interaction,  but 
node  M  does  not,  node  M  maintains  that  ownership  belongs  to  node  L,  even  though  node  L  may  have 
an  out-of-date  copy  of  the  block.  Therefore  the  dependence,  L  M,  between  the  local  node  and  the 
manager  can  not  be  eliminated.  If  node  M  rolls  back  paste  the  interactiem,  but  node  L  does  not,  node  M 
maintains  that  ownership  does  not  beiemg  to  node  L,  even  though  node  L  has  the  latest  copy  of  the  block. 
So  tire  dependence  Af  L  also  remains.  The  dependences  between  node  M  and  node  R  are  removable. 
Since  there  is  already  a  two-way  dependence  between  events  on  nodes  L  and  R,  and  between  events  on 

nodes  L  and  M,  the  two-way  depmidence  between  events  on  nodes  M  and  R  is  superfluous.  If  either  node  R 

c  c 

node  L  rolls  back,  node  M  also  rolls  back  because  of  causal  dependences  R  — LandL  — M.  If  node 
M  rolls  back,  the  logging  dependences  M  L  and  L  R  assure  that  the  state  of  node  M  is  restored 
correctly. 

The  only  depoidences  left  to  consider  are  those  due  to  invalidations  of  remote  read  copies  by  the 
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Figure  6:  Reducing  dependences  with  volatile  access  rights 

local  node  L.  Node  L  can  roll  back  past  an  interaction  in  which  it  invalidated  a  remote  node  R’  since  one 
of  our  assumptions  allows  non-owned  read-only  nodes  to  be  invalidated  spontaneously.  Without  volatile 
ownoship,  it  was  not  allowable  to  roll  back  node  R'  past  an  invalidatimi  without  replaying  the  invalidation 
message  from  a  log.  Howevo;  if  ownoship  is  not  volatile  upon  rollback,  node  R’  always  g^  a  new  copy 
from  the  owner  (m  the  first  access  to  a  block.  In  that  case  there  is  no  possibility  of  inconsistency.  Therefore 
the  remaining  B!  L  dependence  is  eliminated,  resulting  in  a  dependence-fiee  invalidation  interaction. 

Rgure  6  shows  the  draendences  eliminated  wlren  using  the  volatile  access  rights  assumption  As  before. 


« 


on  a  remote  read  access,  the  three  causal  and  three  logging  dqwndaices  are  reduced  to  just  one 
dependence.  On  a  remote  write  access,  it  is  not  possible  to  remove  all  dqwndences  with  the  manager.  But 
all  dependences  on  invalidations  are  eliminated, 

4,6  Imptemmitatkm  Issues 

A  DSM  system  consists  of  a  number  of  processing  nodes  connected  by  a  high-speed  netwoik.  In  a  typical 
shared  virtual  memory  (software  DSM)  system  [14],  all  of  a  node’s  block  servers  and  block  managers  are 
implemented  in  a  single  entity,  inside  or  outside  the  operating  system  kernel.  Access  ri^t  and  ownetshi' 
information  is  maintained  trmispetently  to  the  uso;  in  page  tables.  In  some  hardware  DSM  systems, 
block  servw  access  ri^ts  informatioo  is  kept  in  the  diiecttMies  of  the  ptocessn-  caches,  while  centralized 
directories  at  die  main  memory  elements  are  used  to  store  block  manager  ownership  informatimi  [1].  In 
an  alterative  hardware  qiproach,  directories  at  the  node/netwoik  interface  server  both  as  block  servers  and 
block  managers  [13].  The  d^ection  of  an  error  on  a  processing  node  forces  the  rollback  of  the  node, 
including  its  duecunries  and  page  tables.  Similarly,  the  forced  rollback,  due  to  rollback  pn^iagation,  of  any 
Mode  servCT  or  block  manago:  on  a  node  forces  the  rollback  of  the  complete  node. 

When  using  die  volatile  ownership  checlqiointing  and  rollback  method,  it  is  not  necessary  to  checlqxiint 
ownership  infiirmation  maintained  by  die  block  managers.  Upon  rollback,  all  ownership  table  entries  are 
set  to  unknown,  and  necessary  ownership  inftxmation  is  recovered  through  broadcast  This  may  simplify 
implementadon  in  hardware  DSM,  because  ownership  information  in  the  central  directories  need  not  be 
cbeckpmnted.  A  m(»e  important  advanta^  of  the  volatile  ownership  method  is  that  the  block  managers 
are  not  involved  in  any  dependences.  Therefore  ownership  tables  do  not  need  to  participate  in  itependency 
tiaddng.  Rollbadcpropagatioa  is  reduced,  because  there  is  no  possibility  of  introducing  a  dependence  with 
a  node  throu^  interaction  with  its  block  numager. 

The  volatile  access  rights  method  has  the  advantage  that  access  right  information  never  needs  to  be  saved 
in  a  dieckpoint  Besides  the  actual  cmnputational  state  of  the  node  processes,  only  ownership  tables  need 
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Tririe  1:  Addiess  trace  characteristics. 
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tobechedqKunted.  In  a  shared  virtual  memory  system,  where  page  tables  may  reside  in  the  system  addiess 
space,  this  simplifies  die  task  of  user-level  checlqiointing  of  a  node.  In  cache-based  systems  the  cache 
ditectoiy  is  usually  inaocessiUe  for  chedqxnnting  except  throng  special  hardware.  The  volatile  access 
rights  mediod  avoids  die  need  for  such  special  hardware.  Invalidadms  do  not  cause  any  dependences  under 
the  volatile  access  rights  mediod.  Since  invalidadcms  usually  occur  on  multiple  nodes,  this  reduces  the 
diance  of  excessive  rollback  propagation  to  many  nodes. 


5  Dependence  Frequency  Measurements 


lb  evatnale  the  efifectiveness  in  reducing  dependences  of  our  schemes,  we  perfcMnned  trace-driven 
simulatioiM  widi  multiptocessor  address  traces  from  five  parallel  scientific  programs  running  on  an  Encore 
Multimax.  The  traces  are  firom  executirm  on  seven  processors,  and  each  crmtain  at  least  80  million  memory 
references  [19].  Table  1  describes  the  characteristics  of  the  traces  used. 

Figure  7  presents  the  frequency  of  dependences  in  the  message-passing  model,  the  volatile  owner- 
shqi  model,  and  the  volatile  access  rights  model.  Both  DSM  models  reduce  the  number  of  dependences 
significaittly.  The  volatile  ownership  model  slightly  outperfcmns  the  volatile  access  rights  model.  An  im- 
{dementation  using  the  relaxed  models  therefore  incurs  less  dependency  tracking  overiiead.  In  additirm,  the 
probabilhy  of  rollback  propagiUion  is  reduced.  Both  causal  dependmices  and  total  dependences  are  plotted 
inRgnreS.  In  bodi  DSM  models  the  causal  dqiendmices  are  in  the  miyority,  but  logging  dqimidences  take 
up  a  large  proportion.  An  imptememation  that  uses  logging  will  not  have  to  consider  logging  dependences 
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daring  ndlbid^,  ftiidier  deoeisiiig  the  probflbUity  (rf  roUback  pn^>agati<»^ 


Since  a  system  is  a  specialised  implementatkm  oi  a  message-passing  system,  it  is  possible  to  use 
message-passing  dqpendency  tracking  m  implementing  checkpointing  and  roUback  error  recovery.  However, 
we  have  shown  that,  by  analyzing  qwdfic  prc^rerties  of  every  interaction  between  processes  in  a  DSM 
syston,  it  is  possiUe  to  eliminale  some  of  the  message-passing  dependncies.  Reducing  the  number  of 
dqwndenoes  reduces  die  overhead  of  checkpomting  methods,  and  the  chance  of  needing  to  undo  an  excessive 
amoont  of  worit  after  an  error  due  to  ndRiack  propagation.  Results  ftom  trace-driven  simulations  illustrate 
the  eftbcdveaess  of  oar  methods. 
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